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Abstract 

We report on work in progress on extract- 
ing lexical simplifications (e.g., "collaborate" 
— > "work together'), focusing on utilizing 
edit histories in Simple English Wikipedia for 
this task. We consider two main approaches: 
(1) deriving simplification probabilities via an 
edit model that accounts for a mixture of dif- 
ferent operations, and (2) using metadata to 
focus on edits that are more likely to be sim- 
plification operations. We find our methods 
to outperform a reasonable baseline and yield 
many high-quality lexical simplifications not 
included in an independently-created manu- 
ally prepared list. 

Published at: NAACL 2010 (short paper) 
1 Introduction 

Nothing is more simple than greatness; indeed, to be 
simple is to be great. — Emerson, Literary Ethics 

Style is an important aspect of information pre- 
sentation; indeed, different contexts call for differ- 
ent styles. Here, we consider an important dimen- 
sion of style, namely, simplicity. Systems that can 
rewrite text into simpler versions promise to make 
information available to a broader audience, such as 
non-native speakers, children, laypeople, and so on. 

One major effort to produce such text is the 
Simple English Wikipedia (henceforth SimpleEW) 1 , 
a sort of spin-off of the well-known English 
Wikipedia (henceforth ComplexEW) where hu- 
man editors enforce simplicity of language through 
rewriting. The crux of our proposal is to learn lexical 
simplifications from SimpleEW edit histories, thus 
leveraging the efforts of the 1 8K pseudonymous in- 
dividuals who work on SimpleEW. Importantly, not 
all the changes on SimpleEW are simplifications; we 
thus also make use of ComplexEW edits to filter out 
non-simplifications. 

Related work and related problems Previous 
work usually involves general syntactic-level trans- 

'http : / / simple .wikipedia. org 



formation rules [1,9, 10]. In contrast, we explore 
data-driven methods to learn lexical simplifications 
(e.g., "collaborate" — ► "work together"), which are 
highly specific to the lexical items involved and thus 
cannot be captured by a few general rules. 

Simplification is strongly related to but distinct 
from paraphrasing and machine translation (MT). 
While it can be considered a directional form of 
the former, it differs in spirit because simplification 
must trade off meaning preservation (central to para- 
phrasing) against complexity reduction (not a con- 
sideration in paraphrasing). Simplification can also 
be considered to be a form of MT in which the two 
"languages" in question are highly related. How- 
ever, note that ComplexEW and SimpleEW do not 
together constitute a clean parallel corpus, but rather 
an extremely noisy comparable corpus. For ex- 
ample, Complex/Simple same-topic document pairs 
are often written completely independently of each 
other, and even when it is possible to get good 
sentence alignments between them, the sentence 
pairs may reflect operations other than simplifica- 
tion, such as corrections, additions, or edit spam. 

Our work joins others in using Wikipedia revi- 
sions to learn interesting types of directional lexical 
relations, e.g, "eggcorns" 3 [7] and entailments [8]. 

2 Method 

As mentioned above, a key idea in our work is to 
utilize SimpleEW edits. The primary difficulty in 
working with these modifications is that they include 
not only simplifications but also edits that serve 
other functions, such as spam removal or correction 
of grammar or factual content ("fixes"). We describe 
two main approaches to this problem: a probabilis- 
tic model that captures this mixture of different edit 
operations (§2.1), and the use of metadata to filter 
out undesirable revisions (§2.2). 

2 One exception [5] changes verb tense and replaces pro- 
nouns. Other lexical-level work focuses on medical text [4, 2], 
or uses frequency-filtered WordNet synonyms [3]. 

"A type of lexical corruption, e.g., "acorn"— >-"eggcorn". 



2.1 Edit model 

We say that the k th article in a Wikipedia corre- 
sponds to (among other things) a title or topic (e.g., 
"Cat") and a sequence dk of article versions caused 
by successive edits. For a given lexical item or 
phrase A, we write A E dk if there is any version 
in dk that contains A. From each dk we extract a 
collection e& = (e^i, ek,2, ■ ■ ■ , ^k,n k ) of lexical edit 
instances, repeats allowed, where e^j = A — >■ a 
means that phrase A in one version was changed to 
a in the next, 4 / b; e.g., "stands for" — > "is the 
same as". (We defer detailed description of how we 
extract lexical edit instances from data to §3.1.) We 
denote the collection of dk in ComplexEW and Sim- 
pleEW as C and S, respectively. 

There are at least four possible edit operations: fix 
(oi), simplify (02), no-op (03), or spam (04). How- 
ever, for this initial work we assume ^(04) = 0. 4 

Let P(oi I A) be the probability that o« is applied 
to A, and P(a \ A,Oi) be the probability of A — >■ a 
given that the operation is Oj. The key quantities of 
interest are P(o2 \ A) in S, which is the probability 
that A should be simplified, and P{a \ A,02), which 
yields proper simplifications of A. We start with an 
equation that models the probability that a phrase A 
is edited into a: 

P(a\A)=Y J P(.Oi\A)P{a\A,o i ), (1) 

where $7 is the set of edit operations. This involves 
the desired parameters, which we solve for by esti- 
mating the others from data, as described next. 

Estimation Note that P(a \ A, o 3 ) = if A ^ a. 
Thus, if we have estimates for o\ -related probabili- 
ties, we can derive 02-related probabilities via Equa- 
tion 1. To begin with, we make the working as- 
sumption that occurrences of simplification in Com- 
plexEW are negligible in comparison to fixes. Since 
we are also currently ignoring edit spam, we thus 
assume that only o\ edits occur in ComplexEW. 5 

Let fc{A) be the fraction of dk in C 
containing A in which A is modified: 

\{d k eC\3a,i SUCh that e M =A->q}| 

\{d k eC\Aed k }\ 



fc(A) 



4 Spam/vandalism detection is a direction for future work. 

5 This assumption also provides useful constraints to EM, 
which we plan to apply in the future, by reducing the number of 
parameter settings yielding the same likelihood. 



We similarly define fs(A) on dk in S. Note that we 
count topics (version sequences), not individual ver- 
sions: if A appears at some point and is not edited 
until 50 revisions later, we should not conclude 
that A is unlikely to be rewritten; for example, the 
intervening revisions could all be minor additions, 
or part of an edit war. 

If we assume that the probability of any particular 
fix operation being applied in SimpleEW is propor- 
tional to that in ComplexEW — e.g., the SimpleEW 
fix rate might be dampened because already-edited 
ComplexEW articles are copied over — we have 6 

P(oi I A) = afc(A) 
where < a < 1. Note that in SimpleEW, 

P( 0l Vo 2 I A) = P(oi I A) + P(o 2 I A), 

where P(o\ V 02 \ A) is the probability that A is 
changed to a different word in SimpleEW, which we 
estimate as P{o\ V 02 \ A) = fs(A). We then set 



P(o 2 I A) = max(0,f s (A) -afc(A)). 



Next, under our working assumption, we estimate 
the probability of A being changed to a as a fix 
by the proportion of ComplexEW edit instances that 
rewrite A to a: 



P(a I A, 01) 



\{{k, i) pairs | e k ,i = A ->■ o A d k G C}\ 
Eo' \i( k > *) P airs I e M = A -> a' A d k e C}\ 



A natural estimate for the conditional probability 
of A being rewritten to a under any operation type 
is based on observations of A — > a in SimpleEW, 
since that is the corpus wherein both operations are 
assumed to occur: 



P(a I A) 



\{(k, i) pairs | ek,i = A — > a A dk G S}\ 
£ a , \{{k,i) pairs I e M = A -> a' A d k G S}\ 

Thus, from (1) we get that for A / a: 



P = P(a|A)-P(o 1 |A)P(a|A,o 1 ) 

P(o 2 |A) 



2.2 Metadata-based methods 

Wiki editors have the option of associating a com- 
ment with each revision, and such comments some- 
times indicate the intent of the revision. We there- 
fore sought to use comments to identify "trusted" 



Throughout, "hats" denote estimates. 



revisions wherein the extracted lexical edit instances 
(see §3.1) would be likely to be simplifications. 

Let Ffc = (r k , . . . , r l k , . . . ) be the sequence of revi- 
sions for the k th article in SimpleEW, where r k is the 
set of lexical edit instances (A — > a) extracted from 
the i th modification of the document. Let c l k be the 
comment that accompanies r k , and conversely, let 
R(Set) = {r k \c k G Set}. 

We start with a seed set of trusted comments, 
Seed. To initialize it, we manually inspected a small 
sample of the 700K+ SimpleEW revisions that bear 
comments, and found that comments containing a 
word matching the regular expression *simpl* (e.g, 
"simplify") seem promising. We thus set Seed := 
{ * simpl*} (abusing notation). 

The Simpl method Given a set of trusted revi- 
sions TRev (in our case TRev = R(Seed)), we 
score each A — >■ a G TRev by the point-wise mu- 
tual information (PMI) between A and a. 7 We write 
RANK(TRev) to denote the PMI-based ranking of 
A — > a G TRev, and use Simpl to denote our most 
basic ranking method, RANK(R(Seed)). 

Two ideas for bootstrapping We also considered 
bootstrapping as a way to be able to utilize revisions 
whose comments are not in the initial Seed set. 

Our first idea was to iteratively expand the set 
of trusted comments to include those that most of- 
ten accompany already highly ranked simplifica- 
tions. Unfortunately, our initial implementations in- 
volved many parameters (upper and lower comment- 
frequency thresholds, number of highly ranked sim- 
plifications to consider, number of comments to add 
per iteration), making it relatively difficult to tune; 
we thus omit its results. 

Our second idea was to iteratively expand the 
set of trusted revisions, adding those that contain 
already highly ranked simplifications. While our 
initial implementation had fewer parameters than 
the method sketched above, it tended to terminate 
quickly, so that not many new simplifications were 
found; so, again, we do not report results here. 

An important direction for future work is to differ- 
entially weight the edit instances within a revision, 
as opposed to placing equal trust in all of them; this 

7 PMI seemed to outperform raw frequency and conditional 
probability. 



could prevent our bootstrapping methods from giv- 
ing common fixes (e.g., "a" — > "the") high scores. 

3 Evaluation 8 

3.1 Data 

We obtained the revision histories of both Sim- 
pleEW (November 2009 snapshot) and ComplexEW 
(January 2008 snapshot). In total, ~1.5M revisions 
for 81733 SimpleEW articles were processed (only 
30% involved textual changes). For ComplexEW, 
we processed ~16M revisions for 19407 articles. 

Extracting lexical edit instances. For each ar- 
ticle, we aligned sentences in each pair of adja- 
cent versions using tf-idf scores in a way simi- 
lar to Nelken and Shieber [6] (this produced sat- 
isfying results because revisions tended to repre- 
sent small changes). From the aligned sentence 
pairs, we obtained the aforementioned lexical edit 
instances A — > a. Since the focus of our study 
was not word alignment, we used a simple method 
that identified the longest differing segments (based 
on word boundaries) between each sentence, except 
that to prevent the extraction of entire (highly non- 
matching) sentences, we filtered out A — > a pairs if 
either A or a contained more than five words. 

3.2 Comparison points 

Baselines Random returns lexical edit instances 
drawn uniformly at random from among those ex- 
tracted from SimpleEW. FREQUENT returns the 
most frequent lexical edit instances extracted from 
SimpleEW. 

Dictionary of simplifications The SimpleEW ed- 
itor "Spencerk" (Spencer Kelly) has assembled a list 
of simple words and simplifications using a combi- 
nation of dictionaries and manual effort 9 . He pro- 
vides a list of 17,900 simple words — words that do 
not need further simplification — and a list of 2000 
transformation pairs. We did not use Spencerk's set 
as the gold standard because many transformations 
we found to be reasonable were not on his list. In- 
stead, we measured our agreement with the list of 
transformations he assembled (SpList). 

"Results at http://www.cs.cornell.edu/home/llee/data/simple 
9 http://www.spencerwaterbed.com/soft/simple/about.html 



3.3 Preliminary results 

The top 100 pairs from each system (edit model 10 
and SlMPL and the two baselines) plus 100 ran- 
domly selected pairs from SpList were mixed and 
all presented in random order to three native English 
speakers and three non-native English speakers (all 
non-authors). Each pair was presented in random 
orientation (i.e., either as A — >■ a or as a — >■ A), 
and the labels included "simpler", "more complex", 
"equal", "unrelated", and"?" ("hard to judge"). The 
first two labels correspond to simplifications for the 
orientations A —> a and a — > A, respectively. Col- 
lapsing the 5 labels into "simplification", "not a sim- 
plification", and "?" yields reasonable agreement 
among the 3 native speakers (k = 0.69; 75.3% of the 
time all three agreed on the same label). While we 
postulated that non-native speakers 11 might be more 
sensitive to what was simpler, we note that they dis- 
agreed more than the native speakers (k = 0.49) and 
reported having to consult a dictionary. The native- 
speaker majority label was used in our evaluations. 

Here are the results; "-x-y" means that x and y are 
the number of instances discarded from the precision 
calculation for having no majority label or majority 
label "?", respectively: 



Method 


Prec@100 


# of pairs 


SpList 


86% (-0-0) 


2000 


Edit model 


77% (-0-1) 


1079 


SlMPL 


66% (-0-0) 


2970 


Frequent 


17% (-1-7) 




Random 


17% (-1-4) 





Both baselines yielded very low precisions — 
clearly not all (frequent) edits in SimpleEW were 
simplifications. Furthermore, the edit model yielded 
higher precision than SlMPL for the top 100 pairs. 
(Note that we only examined one simplification per 
A for those A where P(o2 \ A) was well-defined; 
thus "# of pairs" does not directly reflect the full 
potential recall that either method can achieve.) 
Both, however, produced many high-quality pairs 
(62% and 71% of the correct pairs) not included in 
SpList. We also found the pairs produced by these 
two systems to be complementary to each other. We 

10 We only considered those A such that freq(A — > *) > 
1 A freq(A) > 100 on both SimpleEW and ComplexEW. The 
final top 100 A — >■ a pairs were those with As with the highest 
P(o 2 | A). We set q = 1. 

"Native languages: Russian; Russian; Russian and Kazakh. 



believe that these two approaches provide a good 
starting point for further explorations. 

Finally, some examples of simplifications found 
by our methods: "stands for" — s- "is the same 
as", "indigenous" — > "native", "permitted" —?■ "al- 
lowed", "concealed" — > "hidden", "collapsed" — > 
"fell down", "annually" — > "every year". 

3.4 Future work 

Further evaluation could include comparison with 
machine-translation and paraphrasing algorithms. It 
would be interesting to use our proposed estimates 
as initialization for EM-style iterative re-estimation. 
Another idea would be to estimate simplification pri- 
ors based on a model of inherent lexical complexity; 
some possible starting points are number of sylla- 
bles (which is used in various readability formulae) 
or word length. 
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